Using the Kriging Correlation for unsupervised feature selection problems

This paper proposes a KC Score to measure feature importance in clustering analysis of high-dimensional data. The KC Score evaluates the contribution of features based on the correlation between the original features and the reconstructed features in the low dimensional latent space. A KC Score-based feature selection strategy is further developed for clustering analysis. We investigate the performance of the proposed strategy by conducting a study of four single-cell RNA sequencing (scRNA-seq) datasets. The results show that our strategy effectively selects important features for clustering. In particular, in three datasets, our proposed strategy selected less than 5% of the features and achieved the same or better clustering performance than when using all of the features.

. For each gene, we calculate its Laplacian Score by S M (X) and KC Score by Z M (X) , details are given in "Methods" section. Let V k be the feature collection of the first k highest KC Scores; hence V k is a submatrix of X with dimension n × k . Let S G (V k ) be the similarity matrix estimated by the single Gaussian kernel method based on the V k , where the superscript G represents all results generated from a single Gaussian kernel method. Our goal is to find genes that play important roles in clustering by KC Scores. Details (Strategy A) are given below.
[Strategy A] 1. For each k = 1, . . . , p , apply t-SNE on V k with S G (V k ) and denote the associated latent space projection as Z G (V k ) . Then apply k-means on Z G (V k ) to obtain the clustering label vector ŷ G (V k ). 2. For each k = 1, . . . , p , calculate the Pillai's trace in MANOVA for ŷ G (V k ) and Z G (V k ) and denote the result as F (V k ) . Let k 1 = arg max k=1,...,p F (V k ). 3. To further prune V k 1 , consider ,ŷ M (X)) > 0.7 , then output V k 2 and ŷ G (V k 2 ) ; otherwise, go back to steps 1-3, replacing the superscript G by M and output V k 2 and ŷ M (V k 2 ). Figure 1 presents the flow chart of the above procedure. In step 1, the reasons for using S G are twofold: S G requires lower computational costs than S M ; the clustering performances of S G and S M are comparable when the number of critical genes ( k 2 in step 3) is small. In step 2, we decide the initial features set V k 1 by maximizing the Pillai's trace statistics in MANOVA, which corresponds to the variance ratio of between-group and within-group. Since V k 1 still possibly contains some irrelevant or non-significant features, we adopted a pruning step in Strategy A. The pruning step is a widely used technique to further refine the selected features in model selection 6 and treebased methods in machine learning 7 . It aims to reduce variance and avoid overfitting by deleting some irrelevant or non-significant features. Therefore, we pruned the set V k 1 to V k 2 in step 3 of Strategy A such that after dropping unimportant genes in V k 1 , the NMIs between ŷ G (V k 1 ) and ŷ G (V k ) are greater than 0.95 for all k ∈ [k 2 , k 1 ] . In step 4, we check the adequacy of V k 2 by comparing the NMI between ŷ G (V k 2 ) and ŷ M (X) to decide whether to replace superscript G by M in steps 1-3. Similarly, to find the set of critical genes, denoted by V * k 2 , in clustering by Laplacian Scores, we only need to replace the KC Scores by the Laplacian Scores in the above procedure.
Performance of Strategy A. We mainly use two metrics, NNA 2 (under supervised setting) and NMI 2 (under unsupervised setting), to compare the classification and clustering performance based on the latent space projections Z l (V k 2 ) and Z l (V * k 2 ) , where l = G or M. Since a good latent space projection should facilitate distance-based classifiers, NNA is used to measure the goodness of the distance measure from the latent space www.nature.com/scientificreports/ projection Z l (V k 2 ) or Z l (V * k 2 ) . When the NNA(Z l (V k 2 )) is greater than the NNA(Z l (V * k 2 )) , the latent space projection Z l (V k 2 ) is more efficient for classification than the latent space projection Z l (V * k 2 ) and vice versa. We use NMI to evaluate the consistency between the obtained clustering ŷ l (V k 2 ) or ŷ l (V * k 2 ) and the true labels. A higher NMI indicates a better clustering result. In addition to using NNA in a supervised setting, we also adopted random forest to evaluate the classification performances of different methods. We calculate a random forest classifier's average classification accuracy, denoted by RFA, under a 5-fold cross-validation framework. Table 2 reports the number of critical genes k 2 after pruning, k 2 /k 1 , the ratio k 2 /p , and the corresponding NNA, RFA, and NMI based on Z l (V k 2 ) , Z l (V * k 2 ) and Z M (X) for the four datasets. The results in Table 2 show that, in most cases, Strategy A only select a small percentage ( k 2 /p ) of features, but achieve NNA, RFA, and NMI that are comparable to or even better than using all features. In view of k 2 /k 1 , Strategy A pruned over 25% features in V k 1 for the mECS, Kolod, and Usoskin datasets. In particular, for the Kolod dataset, Strategy A only requires 2 genes, accounting for 0.01% , to achieve the same performance based on all features. Also, for the Usoskin dataset, NNA, RFA, and NMI based on KC and Laplacian scores are higher than the benchmark.
To further evaluate the performance of KC and Laplacian Scores, we compare their NMIs based on the two k 2 's selected respectively by the KC and Laplacian Scores in Table 2, see Fig. 2. The results show that the KC Score has either comparable or better NMI for the four datasets than the Laplacian Score for both of the two k 2 's. Figure 3 illustrates the reason why the first step of Strategy A adopts the similarity matrix ( S G ) estimated by the single Gaussian kernel method rather than the one ( S M ) estimated by multi-resolution Gaussian kernels. Figure 3 presents the curves of NNA in (a)-(d) and NMI in (e)-(h) versus the feature size (log scale) for the four datasets. Each subfigure plots the curves of KC Score with single Gaussian kernel (red), KC Score with SIMLR (yellow), Laplacian Score with single Gaussian kernel (blue), and Laplacian Score with SIMLR (green). For the Kolod, Pollen, and Usoskin datasets, the KC Score and Laplacian Score with single Gaussian kernel perform better than the counterparties with SIMLR. The KC Score with single Gaussian kernel reaches the highest NNA and NMI much faster than the other three methods, especially for the Kolod. The corresponding highest NNA and NMI occur at k o = 2 , where k o denotes Table 2.

Comparison of single and multi-resolution Gaussian kernels.
and Z S (X) for the four data sets. www.nature.com/scientificreports/ the minimal k at which the NNA or NMI attains the highest peak of each method in Fig. 3. In contrast, the KC Score and Laplacian Score with SIMLR perform better than the single Gaussian kernel for the mECS dataset, and the highest NNA and NMI of KC Score with SIMLR were reached at k o > 3000 , which are larger than the k o values of the other three methods. Hence, KC Score with single Gaussian kernel is recommended when the number of the critical genes is small, but KC Score with SIMLR is recommended when the number is large. One possible explanation of this phenomenon is that multi-resolution Gaussian kernels are designed for collecting a larger set of features than a single Gaussian kernel since different kernels might highlight various critical features. Nevertheless, if the number of helpful classification features is small, a single kernel might be good enough to identify these genes. This might explain why our numerical experiments reveal that KC score with a single Gaussian kernel performs better than the multi-kernel approach if k o is small. Moreover, each subfigure in Fig. 3 is also marked with the k 2 values of the KC Score and the Laplacian Score obtained from Table 2. The NMI and NNA at k 2 are higher than those at most of the other k values in each dataset, which indicates that the k 2 features recommended by Strategy A can produce satisfactory clustering performances for the four datasets.

Discussion
This study proposes a KC Score to measure feature importance. The KC Score is designed by measuring the correlation between the original and the associated reconstruction genes expression based on the latent space obtained from SIMLR and t-SNE. A feature selection strategy is also developed for the KC Score or Laplacian Score to select the critical genes. The strategy is applied to four datasets. The results show that, when there are few critical genes, the latent space based on KC Score and single Gaussian kernel has the best performance. In contrast, the latent space based on SIMLR is recommended when the number of critical genes is relatively large. In particular, for the Kolod dataset, Strategy A with a KC Score only selects two critical genes to achieve perfect clustering, meaning that the corresponding NMI is 1. To gain more insights into how those two selected critical genes produce perfect clustering, Fig. 4 shows the scatter plot of the first two (the 9708th and 11221st) critical cell gene expressions. In Fig. 4, if both two cell genes' expressions occur dropout, the cells are classified as Class 1 (red points); if only the 9708th cell gene expression is dropped, the cells are classified as Class 2 (green points); if neither of the two cell gene expressions dropout, the cells are classified as Class 3 (blue points). As the figure shows, the three classes are separated perfectly among the above three dropout patterns. This phenomenon shows that dropout patterns can provide helpful and informative signals for scRNA-seq clustering. In the literature, other studies also reported similar findings that dropout patterns might be a helpful signal in singlecell data analysis [8][9][10] . Nonetheless, dropout patterns may be less informative when they are very dispersed, as is common in other areas such as microbiome data.
Moreover, concerning the NMIs in Table 2 for the Usoskin dataset, the clustering performance improves markedly, from 0.74 to over 0.93, by proceeding feature selection. To visualize this finding, Fig. 5 shows the projections and clustering results in the three 2-dimensional latent spaces obtained by t-SNE: (a) Z G (V 65 ) , (b) Z G (V * 55 ) , and (c) Z M (X) . In Fig. 5, the clustering performances of Z G (V 65 ) and Z G (V * 55 ) are shown to be comparable. However, compared to Z G (V 65 ) and Z G (V * 55 ) , the separability among the four classes in Z M (X) is relatively low, especially for the cells in Class 4 (purple points), since they are divided incorrectly into two groups. This finding highlights that identifying and using critical features for clustering is more effective than all features. www.nature.com/scientificreports/ In the procedure of Strategy A, we adopted the unsupervised learning method SIMLR to cluster the cells. In SIMLR, one must conduct many matrix calculations with size n × n , where n denotes the number of cells. Therefore, the computational cost is very expensive with a large n, which leads to a limitation of Strategy A. Table 3 presents the running hours spent in steps 1 and 2 of Strategy A when applying different methods to the four datasets. All the computations are conducted on servers with 2.40 GHz CPU, NVIDIA-SMI 430.50 GPU, and 126 GB RAM. One can see that the running hours of using SIMLR to compute the KC Score and Laplacian Score are roughly proportional to n 2 . Hence, the running hours of the Kolod ( n = 704 ) and Usoskin ( n = 622 ) datasets dramatically increase compared to the other two datasets with smaller n. In addition, the running hours of SIMLR are remarkably more extensive than those of the associated single Gaussian kernel. Consequently, once the challenge of the heavy computational burden in SIMLR with a large n can be solved in the future, this limitation of Strategy A could be released. Figure 6 presents the MANOVA Pillai's Trace statistic curves for the four datasets. From the figure, one can find that the F statistic is not a monotonic function of k. In particular, the curves tend to decrease as the number of features is large enough in the Kolod, Pollen, and Usoskin datasets. Moreover, it can be seen that the curves in Fig. 6 are not smooth around k 1 and severely fluctuate for k ≤ k 1 in the mECS, Pollen, and Usoskin datasets.   Doing so is beyond the scope of this study, and we leave it to our future work. In addition, we investigate the performance of strategy A in imbalanced data by delving into the most imbalanced one, Pollen, among the four datasets. The Pollen dataset consists of 11 classes, and one of the classes only contains 3% of cells in the dataset. Figure 7 shows the confusion table of the clustering result of Z G (V k 2 ) in Table 2, where the clusters are rearranged with the highest accuracy. The results in Fig. 7 reveal that the sensitivity of the 5th class is indeed affected by its low proportion of cells in the dataset. In general, the classification problem of imbalanced data is significantly challenging, even in supervised learning. In this study, since we considered  www.nature.com/scientificreports/ where Z := 1 Z ,z := (1, z), φ(z) := (φ 1 (z), . . . , φ n (z)) ′ , � ik = φ k (z i. ), and Step 2: The KC Score is defined as (KC) k = Cor(x .k ,x .k ), where x k are the fitted values obtained from Step 1. Evaluation criteria.

NNA
We consider a performance metric in a supervised setting, namely the Nearest Neighbor Accuracy (NNA). We use 5-fold cross-validation on the transformed matrix Z l (V j ) and its true labels y, where l = G or M. For each trial, we use four folders as the training set and the remaining one as the validation set. For each cell in the validation set, its class is assigned as the label of the training set object that is smallest Euclidean distance from the target cell. The NNA l j is defined as the average accuracy of the five validation sets.

NMI
Normalized Mutual Information (NMI) is a measure to evaluate the clustering consistency between the two clusters U = (U 1 , . . . , U n ) and V = (V 1 , . . . , V n ) . Let P and Q denote the number of labels in U and V, respectively. Let n pq = |{i = 1, . . . , n : U i = p, V i = q}| , n p+ = Q q=1 n pq , and n +q = P p=1 n pq . The NMI is defined as where Received: 10 February 2022; Accepted: 24 June 2022